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Abstract 

We address the problem of automati- 
cally acquiring case frame patterns from 
large corpus data. In particular, we 
view this problem as the problem of 
estimating a (conditional) distribution 
over a partition of words, and propose a 
new generalization method based on the 
MDL (Minimum Description Length) 
principle. In order to assist with the ef- 
ficiency, our method makes use of an ex- 
isting thesaurus and restricts its atten- 
tion on those partitions that are present 
as 'cuts' in the thesaurus tree, thus re- 
ducing the generalization problem to 
that of estimating the 'tree cut models' 
of the thesaurus. We then give an effi- 
cient algorithm which provably obtains 
the optimal tree cut model for the given 
frequency data, in the sense of MDL. 
We have used the case frame patterns 
obtained using our method to resolve 
pp-attachment ambiguity. Our experi- 
mental results indicate that our method 
improves upon or is at least as effective 
as existing methods. 
Keyword: Corpus-Based Lan- 
guage Processing, Natural Lan- 
guage Learning, Case Frame, MDL 
Principle, Disambiguation 

1 Introduction 

We address the problem of automatically acquir- 
ing case frame patterns from large corpus data. A 
satisfactory solution of this problem would have a 
great impact on various tasks in natural language 
processing, such as the disambiguation problem 
in parsing, a central problem in this field. The ac- 
quired knowledge would also be helpful for build- 
ing a lexicon, as it would provide lexicographers 
with word usage descriptions. 

The purpose of the present research is to pro- 
vide a method by which to acquire knowledge 
from limited data of observed case frames, which 
will allow us to judge the (degree of) accept- 
ability of unseen case frames. Such an acquisi- 
tion procedure will necessarily involve generaliza- 



tion of case frames available in the input data. 
The acquisition process will thus consist of two 
phases: extraction of case frame instances from 
corpus data, and generalization of those instances 
to case frame patterns. For the extraction prob- 
lem, there have been various methods proposed 
to date, which are quite adequate (Brent 91; Hin- 
dle & Rooth 91; Grishman & Sterling 92; Man- 
ning 92; Smadja 93; Utsuro et al. 93). The gener- 
alization problem is a more challenging problem 
and has not been solved satisfactorily, although 
various methods have been proposed. Some of 
these methods make use of prior knowledge in the 
form of an existing thesaurus (Resnik 92; Resnik 
93; Framis 94; Almuallim et al. 94), and others 
do not rely on any prior knowledge (Hindle 90; 
Brown et al. 92; Pereira et al. 93; Grishman & 
Sterling 94; Tanaka 94). In this paper, we pro- 
pose a new generalization method which belongs 
to the first of these two categories. 

We formalize the problem of generalizing case 
slots as that of estimating a model of probabil- 
ity distribution over some partition of words, and 
propose a new generalization method based on 
the MDL (Minimum Description Length) prin- 
ciple: a well-motivated and theoretically sound 
principle of statistical estimation from informa- 
tion theory. We also devised an efficient algo- 
rithm which is guaranteed to output an optimal 
model in the sense of MDL, provided we have a 
reliable thesaurus at hand. Finally we empirically 
tested the performance of our method, by using 
the generalized case frame patterns obtained by 
training our method with corpus data to resolve 
pp-attachment ambiguity. Our experimental re- 
sults indicate that our method improves upon or 
is at least as effective as existing methods. 

2 The Problem Setting 

2.1 The Data Sparseness Problem 

Suppose available to us are frequency data of 
the type shown in Figure 1, given by instances 
of a case frame automatically extracted from a 
corpus using conventional techniques. (In the se- 
quel, we will refer to this type of frequency data 
as 'co-occurrence data.') The problem of gener- 
alizing case slots can be viewed as the problem of 




swallow crow eagle bird bug bee insect 



Figure 1: Frequency data for the subject slot of 
verb 'fly' 



ization method that is both efficient and theo- 
retically sound. As an alternative, a number of 
authors have proposed to use class-based mod- 
els, in which the classes (similarities) present in 
an existing thesaurus are used for the purpose of 
smoothing estimated probabilities. 



ANIMAL 




swallow crow eagle bird bug bee insect 



learning the underlying conditional distribution 
which gives rise to such data. Such a conditional 
distribution specifies the conditional probability 1 



P(n\v, s) 



(1) 



for each n in the set of nouns N = 
{ni, ri2, ■ ■ ■ , rijv}, v in the set of verbs V = 
{i>i, i>2, . . . , vy}, and s in the set of slot names 
R = S'j, ■ ■ ■ , sr}. (Such a probability model 2 
is often referred to as a word-based model). Since 
the number of probability parameters in a word- 
based model is very large ((N — 1) x V x R), 
a word-based model is difficult to estimate with 
a reasonable data size that is available in prac- 
tice - a problem usually referred to as the 'data 
sparseness problem.' For example, suppose that 
we employ the well-known Maximum Likelihood 
Estimator (or MLE for short) to estimate the 
probability parameters of a word-based model 
given frequency data in Figure 1. MLE is ob- 
tained by simply normalizing the frequencies so 
that they sum to one, giving, for example, the 
estimated probabilities of 0.0, 0.2, and 0.4 for 
'swallow,' 'eagle,' and 'bird,' respectively. Since 
in general the number of nouns exceeds the size 
of typically available data, MLE will result in 
estimating most of the probability parameters 
to be zero. To address this problem, Grishman 
& Sterling proposed a method of smoothing the 
probabilities using a similarity measure between 
words, which itself is calculated based on co- 
occurrence data (Grishman & Sterling 94). That 
is, probability estimates of words are smoothed 
by weighted averaging using the similarity mea- 
sure as the weights. The fact that this method 
relies on no prior information is an advantage, 
but it also makes it difficult to find a general- 



1 Since the case slots in a case frame are in gen- 
eral not independent, generalization of case frames 
involves generalization of individual case slots, and 
learning of the dependencies that exist between dif- 
ferent case slots. In this paper we confine ourselves 
to the former problem of generalizing case slots. (We 
will address the latter issue in the near future.) 

2 A representation of a (conditional) probability 
distribution is usually called a probability model, or 
simply a model. 



Figure 2: An example thesaurus 



2.2 Class-Based Models 

An example of a class-based method is Resnik's 
method of generalizing case slots using a the- 
saurus and the so-called selectional association 
measure. The selectional association A(v,s,C) 
is defined as follows, 

A(v, S ,C) = P(C^, S )xlo g P{ ^; ) S) (2) 

where C is a class of nouns present in a given 
thesaurus, and v,s are a verb and a slot name, 
as described earlier. In generalizing a given noun 
n using this method, the noun class C with the 
maximum A(v,s,C), among all super classes of 
n in a given thesaurus is output. This method 
is based on an interesting intuition, but its inter- 
pretation as a method of estimating probability 
distributions is yet to be determined. We pro- 
pose a class-based generalization method whose 
performance as a method of estimation is guar- 
anteed to be near optimal. 

In this paper, we define the class-based model 
in the following way. A class-based model con- 
sists of a partition of the set of nouns N, namely 

T C 2 N such that U Ci6r Q = N and VQ,Cj <E 
T C'i n Cj =0, and a number of parameters spec- 
ifying the conditional probability for each C in 
that partition, namely 



P(C\v,s). 



(3) 



Within a given class C, it is assumed that each 
noun is generated with equal probability , i.e., 

1 



Vn£C: P(n\v,s) 



\C\ 



xP(C\v,s) (4) 



Note that this assumption is basically equivalent 
to the assumption made in other class and sim- 
ilarity based methods (Hindle 90; Grishman & 
Sterling 94) that similar words occur in the same 
context with roughly equal likelihood. 

2.3 The Tree Cut Model 

Mainly for the consideration of computational 
tractability, we reduce the number of possible 




swallow crow eagle 



bug 



Figure 3: A tree cut model with [swal- 
low, crow, eagle, bird, bug, bee, insect] 
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Figure 4: A tree 
[BIRD, bug, bee, insect] 



cut model with 



partitions to consider by using an existing the- 
saurus as prior knowledge. That is, we restrict 
our attention on those partitions that exist within 
the thesaurus in the form of a cut in the tree. 
Here we mean by a 'thesaurus' a tree in which 
each leaf node stands for a noun, while each in- 
ternal node represents a noun class, and domina- 
tion stands for set inclusion. (See Figure 2.) A 
cut in a tree is any set of nodes in the tree that 
defines a partition of the leaf nodes, viewing each 
node as representing the set of all the leaf nodes it 
dominates. For example, in the thesaurus of Fig- 
ure 2, there are five cuts: [ANIMAL], [BIRD, IN- 
SECT], [BIRD, bug, bee, insect], [swallow, crow, 
eagle, bird, INSECT], and [swallow, crow, eagle, 
bird, bug, bee, insect]. The class of tree cut mod- 
els of a fixed thesaurus tree is then obtained by 
restricting the partition T in the definition of a 
class-based model to be those that are present 
as a cut in that thesaurus tree. Formally, a tree 
cut model can be represented by a pair consist- 
ing of a tree cut, and a probability parameter 
vector of the same length. 3 For example, M = 
([BIRD, bug, bee, insect], [0.8,0,0.2,0]) is a tree 
cut model, 4 which is shown in Figure 4. 

Recall that M defines a (conditional) probabil- 
ity distribution Pm{p-\v,s) in the following way: 
For any word that is in the tree cut, such as 'bee', 
the probability is given as explicitly specified by 
the model, i.e. _P/vf(bee|fly, argl) = 0.2. For any 
class in the tree cut, the probability is distributed 
uniformly to all words dominated by it. For ex- 
ample, since there are four words that fall un- 
der the class BIRD, and 'eagle' is one of them, 
f M (eagle|fly,argl) = 0.8/4 = 0.2. Note that in 
this way, M 'smoothes' the probabilities assigned 
to the nouns under BIRD, even if they may have 



3 In general, a probability model consists of a dis- 
crete model and a probability parameter vector. The 
tree cut is the discrete model in this case. In the se- 
quel, we sometimes use the discrete model (tree cut) 
to refer to a tree cut model, when the values of the 
probability parameters are clear from the context. 

Note that the probability parameters in M 
were estimated using MLE from the co-occurrence 
data in Figure 1, i.e. /(BIRD|fly, argl) = 8, 
/(bee|fly, argl) = 2, others = 0. 
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Figure 5: A tree cut model with [BIRD,INSECT] 



different observed frequencies. If we use MLE 
for the parameter estimation, we can obtain five 
tree cut models, shown in Figures 3-5, from the 
co-occurrence data in Figure 1. We have thus for- 
malized the problem of generalizing a case slot as 
that of estimating a model from the class of tree 
cut models for some fixed thesaurus tree, namely 
selecting a model which best explains the data 
from among the class of tree cut models. 5 

3 Generalization Method Based 
On MDL 

As our estimation strategy, we employ the MDL 
(Minimum Description Length) principle (Rissa- 
nen 78; Rissanen 84; Rissanen 86). MDL is a 
principle of data compression and statistical es- 
timation from information theory, which states 
that the best probability model for given data is 
that which requires the least code length in bits 
for the encoding of the model itself and the given 
data observed through it 6 . The former is called 
'the model description length' and the latter 'the 



5 There have been a number of methods proposed 
in the literature, which will automatically construct 
the thesaurus itself using co-occurrence data(Hindle 
90; Brown et al. 92; Pereira et al. 93). In this paper, 
we use an existing thesaurus for efficiency purpose, 
although we can extend (and we have extended) our 
method so as to automatically construct a thesaurus. 

We refer the interested reader to (Quinlan & 
Rivest 89) for an introduction to MDL principle. 



data description length.' 

In our current problem, it tends to be the case, 
in general, that a model near the root of the the- 
saurus tree, such as that in Figure 5, is simpler (in 
terms of the number of parameters), but tends to 
have a poorer fit to the data. In contrast, a model 
near the leaves of the thesaurus tree, such as that 
in Figure 3, is more complex, but tends to have 
a better fit to the data. In other words, there is 
a trade-off between the simplicity of a model and 
the goodness of fit to the data. While the model 
description length of MDL is an indicator of the 
former, the data description length is of the lat- 
ter. MDL claims that the model which minimizes 
the sum total of the description lengths should be 
the best model. 

In the remainder of this section, we will de- 
scribe how we apply MDL to our current prob- 
lem. We will then discuss the rationals behind 
using MDL in our present context. 

3.1 Calculating the Description Length 

We first show how the description length for a 
model is calculated. Given a tree cut model M 
and data S, its total description length 7 L(M) 
is computed as the sum of the model description 
length L mod (M) + L par (M), and data description 
length L dat (M). Namely, 

L(M) = L mod (M) + L par (M) + L dat (M) (5) 

Lmod(M) is calculated by 

L mod (M) =\og\G\ (6) 

where Q denotes the set of cuts in the tree T. This 
is because if there are \Q\ cuts in the tree, then 
we need log \Q\ bits to describe each of the cuts 
(for further explanation see (Quinlan & Rivest 
89)). Throughout this paper 'log' denotes the 
logarithm to the base 2. L par (M), often referred 
to as the parameter description length, is calcu- 
lated by, 

L p ar(M)= j xlog|5| (7) 

where K denotes the number of (free) parameters 
in the cut model, i.e. K equals the number of 
nodes in M minus one. It is known to be best to 
use log \f\S\ = log J S ^ bits to describe each of the 
parameters. 8 Finally, L dat {M) is calculated by 

L d at(M) = -£logP(n) (8) 

where for simplicity we write P(n) for Pm{ti\v , s) 
with the parameters estimated using the MLE 
estimate, which is equivalent to minimizing the 

7 L(M) depends on S , but we will leave S implicit. 
One can interpret this as follows. The standard 
deviation of MLE is Ot I — ), and hence the preci- 

sion required for each parameter is 0(log = 
0( i2tl£l). 



data description length. (We will elaborate on 
why this is the case in Subsection 3.3.) Recall 
that P(n) is obtained by normalizing the frequen- 
cies, i.e., 

V«eC,P(n) = pxP(C) (9) 

vcer,F(c) = |^ (io) 

where /(C) denotes the total frequency of nouns 
in class C in sample S, and T a cut. 

With the description length of a tree cut model 
defined in the above manner, we wish to select a 
model with the minimum description length and 
output it as the result of generalization. Since we 
assume here that every cut has an equal L mod , 
technically we need only calculate and compare 
L'(M) = L par (M) + L dat (M) as the description 
length of a model. For simplicity, in the sequel we 
will sometimes write just V or L'(T) for L'(M), 
where T is the tree cut of M . 

The description lengths for the data in Fig- 
ure 1 using various tree cut models of the the- 
saurus in Figure 2 are shown in Table 2. (Table 1 
shows how the description length is calculated 
for the cut [BIRD, bug, bee, insect].) These figures 
indicate that the model in Figure 5 is the best 
model, according to MDL. (Note, as we will see in 
Subsection 3.2, that with different co-occurrence 
data, a different tree cut might be optimal.) 
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Table 1: Parameters in the model of cut 
[BIRD, bug, bee, insect] 

3.2 An Efficient Algorithm 

In generalizing a case slot using MDL, we could 
in principle calculate the description length of ev- 
ery possible model and output a model with the 
minimum description length as the generalization 
result, if computation time were of no concern. 
But since the number of cuts in a thesaurus tree 
of noun is exponential (for example, for a com- 
plete &-ary tree of depth d it is of order 0(2 b )), 
it is impractical to do so. Nonetheless, we were 
able to devise a simple and efficient algorithm, 
which is guaranteed to find a model with the min- 
imum description length. 

Our algorithm, which we call Find-MDL, re- 
cursively finds the optimal MDL model for each 
child subtree of a given node and appends all the 
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Table 2: Description lengths of the tree cut models 



L'([ARTIFACT])=1 9.3 
L'([VEHICLE,AIRPLANE])=18.3 




f(crow|.)=2,f(eagle|.)=2,f(bird|.)=4,f(bee|.)=4,f(jet|.)=2,f(airplane|.)=2 



Figure 6: An example application of Find-MDL 



optimal models of these subtrees and returns it, 
unless collapsing all the lower-level optimal mod- 
els into a single node (that is, a single class) re- 
duces the description length, in which case it does 
do so. The details of the algorithm are given be- 
low. Note that for simplicity we describe Find- 
MDL as outputting a cut, rather than a complete 
model. (It is implicitly assumed that the param- 
eters are estimated using MLE.) 

Here we let t denote a thesaurus tree, 
root(t) the root of the tree. 
Initially t is set to the entire tree, 
algorithm Find-MDL(t) := cut 



1. 


if 


2. 


t is a leaf node 


3. 


then 


4. 


return ([t]) 


5. 


else 


6. 


For each child tree t{ of t 


7. 


c { :=Find-MDL(^) 


8. 


c:= append(cj) 


9. 


if 


10. 


L'([root(*)]) < L'(c) 


11. 


then 


12. 


return([root(t)]) 


13. 


else 


14. 


return(c) 



Figure 7: The algorithm: Find-MDL 

Note in the above algorithm that the parame- 
ter description length is calculated as K ~^ x log l^l, 
where K + 1 is the number of nodes in the cur- 
rent cut, both when t is the entire tree or when it 



is a proper subtree. This contrasts with the fact 
that the number of free parameters is K for the 
former, while it is K + 1 for the latter. For the 
purpose of finding a tree cut with the minimum 
description length, however, this distinction can 
be ignored, (c.f. Appendix) 

Figure 6 illustrates how the algorithm works: 
In the recursive application of Find-MDL on the 
subtree rooted at AIRPLANE, the if-clause on 
line 10 evaluates to true since ^'([AIRPLANE]) 
= 16.3 and i'([jet, helicopter, airplane]) = 18, 
and hence [AIRPLANE] is returned. Then in 
the call to Find-MDL on the subtree rooted 
at ARTIFACT, the same if-clause evaluates 
to false since L'( [ARTIFACT]) = 19.3 and 
^'([VEHICLE, AIRPLANE]) = 18.3, and hence 
[VEHICLE, AIRPLANE] is returned. 

Concerning the above algorithm, we show that 
the following proposition holds, whose proof can 
be found in Appendix A. 

Proposition 1 The algorithm Find-MDL ter- 
minates in time 0(N x l^l), where N denotes the 
number of leaf nodes in the input thesaurus T and 
\S\ denotes the input sample size, and outputs a 
tree cut model ofT with the minimum description 
length. 

3.3 Estimation, Generalization and 
MDL 

When a discrete model (a partition T of the set 
of nouns Af in our present context) is fixed, and 
the estimation problem involves only the estima- 
tion of probability parameters, the classic max- 
imum likelihood estimation (MLE) is known to 
be satisfactory. In particular, the estimation of 



a word-based model is one such problem, since 
the partition is fixed and equals Af. Further- 
more, for a fixed discrete model, it is known 
that the MLE coincides with MDL: Given data 
= l,...,m}, the MLE estimate P 
maximizes the 'likelihood' of the data, that is, 

m 

P = arg max J^J P(xi). (11) 

8 = 1 

It is easy to see that P also satisfies 

m 

P = arg min — log P(xj). (12) 
8=1 

This is nothing but the MDL estimate in this 
case, since Y11=i P( x i) i s the data descrip- 

tion length. 

When the estimation problem involves model 
selection, i.e. the choice of a tree cut in the 
present context, MDL's behavior significantly de- 
viates from that of MLE. This is because MDL 
insists on minimizing the sum total of the data 
description length and the model description 
length, while MLE is still equivalent to mini- 
mizing the data description length only. So for 
our problem of estimating tree cut models, MDL 
tends to select a cut that is reasonably simple 
yet fits the data quite well, whereas the model 
selected by MLE will be a word-based model, as 
it will always manage to fit the data as well as 
any tree cut model. 

In statistical terms, the superiority of MDL as 
an estimation method is related to the fact which 
we noted earlier that even though a word-based 
model can provide the best fit to the given data, 
the estimation of the parameters are poor as there 
are too many parameters to estimate. So MLE, 
when applied on a data set of a modest size, is 
likely to estimate most parameters as zero, and 
thus suffers from the data sparseness problem. 
Note in Table 2, that MDL avoids this problem 
by taking into account the model complexity as 
well as the fit to the data. 

MDL stipulates that the model with the mini- 
mum description length should be selected both 
for data compression and estimation. This in- 
timate connection between estimation and data 
compression can also be thought of as that be- 
tween estimation and generalization, since in or- 
der to compress information, there needs to be 
generalization. In our current problem, this cor- 
responds to the generalization of individual nouns 
present in case frame instances in the data as 
classes of nouns present in a given thesaurus. For 
example, given the thesaurus in Figure 2 and fre- 
quency data in Figure 1, we would like our sys- 
tem to judge that the class 'BIRD' and the word 
'bee' can be the subject of the verb 'fly.' The 
problem of deciding whether to stop generaliz- 
ing at 'BIRD' and 'bee,' or generalizing further 
to 'ANIMAL' has been addressed by a number 
of authors (c.f. (Resnik 92; Resnik 93)). Mini- 
mization of the total description length provides 
a disciplined criterion to do this. 



The remarkable fact about MDL is that theo- 
retical findings (c.f. (Barron & Cover 92; Yaman- 
ishi 92)) have indeed verified that MDL, as an 
estimation method, is near optimal, 9 in terms of 
the speed of convergence 10 of its estimated mod- 
els to the true model, as the data size increases. 
Thus MDL provides (i) a way of smoothing prob- 
ability parameters to solve the data sparseness 
problem, and at the same time (ii) a way of gen- 
eralizing nouns in the data to noun classes of an 
appropriate level, both as a corollary to the near 
optimal estimation of the distribution of the in- 
put data. 

A frequently asked question in cognitive sci- 
ence is that of why humans learn, and it is be- 
lieved by many that there are two major moti- 
vations: To improve the accessibility of accumu- 
lated knowledge, and to interpret new informa- 
tion (Rumelhart & Norman 78). The fact that 
MDL is suited for both compression and esti- 
mation seems to be an affirmative evidence for 
MDL as a possible cognitive model. For example, 
our method of generalizing case frames based on 
MDL will output a compact representation sum- 
marizing the observed data, which is also near 
optimal for predicting the acceptability of unseen 
instances in the future. Thus we feel that our 
method is not only mathematically sound, but 
also cognitive scientifically well- motivated. 

4 Experimental Results 

4.1 Experiment 1 

First, we extracted head, slot-name, slot-value 
triples from the texts of the tagged Wall Street 
Journal corpus (ACL/DCI CD-ROM1) consist- 
ing of 126,084 sentences, using conventional pat- 
tern matching techniques, then applied the algo- 
rithm Find-MDL to generalize the slot-values of 
the triples. 

When generalizing, we used the noun taxon- 
omy of WordNet (versionl.4) (Miller et al. 93) 
as our thesaurus. The noun taxonomy of Word- 
Net has a structure of DAG and the (leaf and 
internal) nodes stand for a word sense and not 
a word, and thus often contain several words of 
the same word sense. Since it does not meet 
the assumption we made on our thesaurus, we 
used it in the following modified form. First, the 
observed frequency of a word as a slot-value of 
given head and slot-name is equally divided be- 



There is a Bayesian interpretation of MDL: MDL 
is essentially equivalent to the posterior mode in the 
Bayesian terminology. It is known that in fact, 
Bayesian posterior mixture is optimal in some sense, 
but it is also known that in many cases these two es- 
timates are approximations of each other (Takeuchi 
95). 

10 The models selected by MDL converge to the true 
model approximately at the rate of 1/K* where K* is 
the number of parameters in the true tree cut model, 
where as for MLE the rate is 1/N, where N is the 
number of leaf nodes. 
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Class 


Prob. 


Example words 


(object ,inanimate_object,physical_object) 


0.30 
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Table 3: An example of generalization result 
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Figure 8: An example generalization result 



tween all the nodes containing that word. Then 
the frequency of an internal node is calculated 
as the sum of the frequencies of all the nodes it 
dominates. Finally we applied our generalization 
algorithm to the tree obtained by discarding from 
the thesaurus those subtrees that are rooted at 
any node containing a word that actually occurs 
as the slot-value. 

Table 3 shows an example generalization re- 
sult; for the direct object slot of the verb 'eat' 
and the verb 'buy', where (...) denotes a node 
in WordNet. Classes with probabilities less than 
the threshold of 0.05 are discarded. Figure 8 
shows the corresponding cut in WordNet for the 
direct object slot of 'eat'. Note, for example, 
that the fact that (animal), (plant) were gen- 
eralized to (life_form) seems reasonable because 
both of these categories are suited for the di- 
rect object slot of 'eat.' On the contrary, (food) 
was not generalized to (substance), which also 
seems correct because not all substances are edi- 
ble. Thus, despite the fact that the employed ex- 
traction method is not noise-free, and word sense 
ambiguities remain after extraction, the general- 
ization result seems to agree with our intuition to 
a satisfactory degree. This is probably because 
the 'noisy' part usually has a small probability 



and thus tends to be discarded. This, we believe, 
is another desirable consequence of using MDL as 
our estimation method. Now, since we can tag 
a plain text with a high accuracy with current 
technology (c.f. (Church 88)), we can acquire 
case frame patterns completely automatically us- 
ing our generalization method, and thereby pro- 
vide useful usage descriptions to lexicographers. 

4.2 Experiment 2 

We conducted another experiment in which we 
used the acquired knowledge to resolve pp- 
attachment ambiguities. First we selected about 
10% of the parsed trees from the parsed Wall 
Street Journal corpus (Penn Tree Bank 1) as test 
data, and used the remainder as training data. 
Then we extracted 181,250 case frames from 
the training data using heuristics and extracted 
172 (verb, noun\, prep, nouri'i) patterns from the 
test data. We generalized the slot-values of 
the head, slot -name, slot -value triples using our 
method (Find-MDL algorithm and WordNet 
were used in the same manner as in experiment 
1, with the output threshold set to 0.05) and se- 
lectional association based on Resnik's method. 
We then used them to disambiguate the 172 pat- 
terns. We also used lexical association proposed 



by Hindle & Rooth (Hindle & Rooth 91) to dis- 
ambiguate those patterns. Although it would be 
possible to resolve word sense ambiguities as well, 
we confined ourselves to the structural disam- 
biguation problem at this stage. 

When using our method for disambiguation, 
we compare P(nouri2\ verb, prep) and P(nouri2\ 
nourii, prep) to determine the attachment site 
of (prep, nouri'i). If the former is larger than 
the latter, we attach it to verb, else if the lat- 
ter is larger than the former, we attach it to 
nourii, and otherwise (especially when both are 
0), we conclude that we cannot make a deci- 
sion. Determining the attachment site in this 
way is natural and we empirically found that 
this gives us the best results in terms of ac- 
curacy. When using the selectional associa- 
tion to disambiguate, we heuristically calculate 
the 't-score' of m&x(A(Class2\v erb, prep)) and 
max(A(Class2\nouni,prep)), where the maxi- 
mization is over nouri2 £ Class2- If the t- 
score is not significant (at significance level 95%), 
we conclude that we cannot make a decision. 
When using the lexical association to disam- 
biguate , we calculate the t-score of P(prep\verb) 
and P(prep\nourii). Again if the t-score is not 
significant, we conclude that we cannot make a 
decision. 





Coverage(%) 


Accuracy(%) 


Default 


100 


70.2 


LA 


87.2 


86.0 


MDL 


49.4 


88.2 


SA 


49.4 


84.7 


MDL2 


65.7 


85.8 



Table 4: 
tion 



Results of PP-attachment disambigua- 



Table 4 shows the results of pp-attachment dis- 
ambiguation in terms of 'coverage' and 'accu- 
racy.' Here 'coverage' refers to the proportion 
(in percentage) of the test patterns on which the 
disambiguation method could make a decision. 
'Default' refers to the method of always attach- 
ing (prep, nouri2) to nourii, while 'MDL,' 'SA,' 
and 'LA,' stand for using MDL, selectional asso- 
ciation, and lexical association, respectively. 

Here are some points that are worth noting 
about these results. First, although the coverage 
of LA is larger than those of both MDL and SA 11 , 
we believe that this is mainly because it uses a 
model not as rish as those of MDL and SA, and 
thus needs less data to estimate its parameters. 
However, as Resnik correctly pointed out, if we 
hope to improve the performance of disambigua- 
tion as we get larger data sizes, we need a richer 
model such as those used in MDL and SA. 



Second, the accuracy of MDL is better than 
that of SA, while its coverage is the same as that 
of SA. MDL tends to generalize only when there 
is enough evidence, and when it does, the re- 
sult seems to fit the human intuition quit well. 
Table 5 shows an example generalization result 
for the 'on' slot for the verb 'watch'. Note that 
MDL does not generalize 'afternoon' because of 
its small frequency, while it has been generalized 
to 'acknowledgement' by SA, which seems rather 
odd. 12 



Input 


Freq. 


watch on afternoon 


1 


watch on screen 


1 


watch on set 


2 


watch on street 


1 


watch on television 


2 


watch on tv 


2 


Output of MDL 


Prob. 


watch on (entity) 


0.59 


Output of SA 


SA 


watch on (television,...) 


1.78 


watch on (artifact,...) 


1.43 


watch on (acknowledgment,...) 


0.81 


watch on (afternoon) 


0.50 



Table 5: A example generalization result 

We also conducted the following additional ex- 
periment. We randomly selected 50%, 60%, 70%, 
80%, and 90% of the training data and applied 
MDL and SA to them, repeated this process ten 
times, and then evaluated the accuracy and cov- 
erage, averaged over the ten trials. Figures 9 and 
10 show the results of this experiment. We found 
that MDL outperforms SA throughout in terms 
of accuracy, and its coverage improves faster than 
SA as the data size increases. 



50 55 60 65 70 75 



85 90 95 100 



Figure 9: Accuracy of MDL and SA 



11 Our result on LA is close to Hindle's, but devi- 
ates from Resnik's, probablely because of the differ- 
ent data used. 
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12 Note that 'afternoon' does not belong to 
'(entity),' and that (some word-sense of it) lies within 
'(acknowledgement),' as in 'good afternoon.' 




50 55 60 65 70 75 



85 90 95 100 



Figure 10: Coverage of MDL and SA 



Admittedly, the coverage of MDL (and of SA) 
is not satisfactory. So we conducted another ad- 
ditional experiment, in which we also generalized 
the head of the triples in the data, provided head 
is a noun. When disambiguating, we compared 
P(nouri2\verb,prep) and P(nouri2\nouni, prep). 
The result of this additional experiment is shown 
in Table 4 as MDL2, which indicates that the 
coverage can be significantly improved this way, 
although the accuracy drops somewhat. In fact, 
the method 13 used in the actual experiment in 
(Resnik 92) employs the version of SA in which 
both slot-value and head are generalized (pro- 
vided head is a noun). This seems to have the ef- 
fect of improving the coverage, since the reported 
coverage (67.2%) is better than that for the ver- 
sion of SA used here which generalized slot-value 
only, while the accuracy (84.6%) is the same as 
what we obtained. 

Finally, we tested the method 'Combined', 
which applies MDL first, and then applies LA 
to the rest, and then finally uses Default on what 
remains. We also tested 'Combined2', which ap- 
plies MDL2 first, and then LA, and finally De- 
fault. Table 6 shows the results of this experi- 
ment. Our final accuracy (84.9%) is better than 
that (78.3%) reported in (Hindle & Rooth 91) 
and that (82.2%) reported in (Resnik 92). We 
conclude that our method improves upon the 
existing methods, although its statistical signif- 
icance is moderate. (The standard deviations for 
these three figures are 2.7% 1.4% and 2.9%, re- 
spectively.) 





Coverage(%) 


Accuracy(%) 


Combined 
Combined2 


100 
100 


84.3 
84.9 



Table 6: Final Results of PP-attachment disam- 
biguation 



13 We did not implement the exact method actually 
used in (Resnik 92). 



5 Conclusions 

We proposed a new method of generalizing case 
frames. We believe that our method has the fol- 
lowing merits: (1) It is theoretically sound; (2) It 
is cognitive scientifically well-motivated; (3) It is 
computationally efficient; (4) It is robust against 
noise. The disadvantage of our method is that 
its performance depends on the structure of the 
particular thesaurus used. This, however, is a 
problem commonly shared by any generalization 
method which uses a thesaurus as prior knowl- 
edge. Our experimental results indicate that the 
performance of our method is better or at least 
as good as existing methods. 
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A Proof of Proposition 1 

For an arbitrary subtree T' of a thesaurus tree 
T and an arbitrary tree cut model M of T, let 
Mnf denote the submodel of M that is con- 
tained in T' . Also for any sample S and any sub- 
tree T", let SnT' denote the subsample of S con- 
tained in T' . Then define 14 L dat {M' , S') to be the 
data description length of (sub)sample S' using 
(sub)model M', L par (M', \S\) the parameter de- 
scription length for the parameters in (sub)model 
M' with (total) sample size l^l, and finally 
L'(M', S', \S\) = L dat (M', S') + L par (M', \S\) in 
general for any (sub)model M' and (sub)sample 
5" of S. 

First note that for any (sub)tree T, (sub)model 
M nT, (sub)sample SnT, and T"s child subtrees 
Ti : i = 1, .., k, we have 

L dat (MnT,SnT)= J2 L dat {MnTi,SnTi). 

i=l,..,k 

( 13 ) 

This follows from the mutual disjointness of the 
T, and the independence of the parameters in 
the Ti. We also have, when T is a proper subtree 
of the entire thesaurus tree, 

L par (MnT,\S\) = L par (MnTi,\S\). (14) 

i=l,..,k 

Since the number of free parameters of a model 
in the entire thesaurus tree equals the number 
of nodes in the model minus one due to the 
stochastic condition (that the probability param- 
eters must sum to one), when T equals the entire 
thesaurus tree, theoretically the parameter de- 
scription length for a tree cut model of T should 
be 

L par (MnT,\S\) = L par (MnTi,\S\)- 1 ^^- 

i=l,..,k 

(15) 

Since the second term — log J' s l in (15) is constant 
once the input sample S is fixed, for the purpose 
of finding a model with the minimum descrip- 
tion length, it is irrelevant. We will thus use the 
identity (14) both when T is the entire tree and 
when it is a proper subtree. (This allows us to 
use the same recursive algorithm (Find-MDL) in 
all cases.) 

It follows from (13), and (14) that the mini- 
mization of description length can be done essen- 
tially independently for each subtree. Namely, if 
we let L' gpt (MnT, SnT, \S\) denote the minimum 
description length achievable for the (sub)model 

14 Note that in Section 3 L da t, L par and V were 
denned as functions of one argument, leaving the de- 
pendency on the sample implicit. Here we make it 
explicit as the sample does not always equal S. 



MflTon the (sub)sample SnT, P s (r]) the MLE 
estimate for node r\ using sample S, and root(T) 
the root node of (sub)tree T , then we have 

L' opt (MnT,snT,\s\) = 

min{L'(([root(T)], [P s (root(T))]), SnT, \S\), 
Ei=i,.., k L'„pt(MnT i ,SnT i ,\S\)}. 

(16) 

The rest of the proof proceeds by induction. 
First, when T consists of a single leaf node, the 
MLE for the class represented by T is returned, 
which is known to minimize the data descrip- 
tion length. (Clearly, the parameter description 
length is identical for all.) Next, inductively 
assume that Find-MDL(T') correctly outputs a 
model with the minimum description length for 
any tree T' of size less than n. Then, given a 
tree T of size n whose root node has at least 
two children, say TJ- : i = l,..,k, for each 
Find-MDL(Tj) returns a model with the mini- 
mum description length by the inductive hypoth- 
esis. Then, since (16) holds, whichever way the 
if-clause on lines 9, 10 of Find-MDL evaluates to, 
what is returned on line 12 or line 14 will still be 
a model with the minimum description length, 
completing the inductive step. It is easy to see 
that the running time of the algorithm is linear 
in both the size of the input thesaurus tree and 
the sample size. □ 
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